Skip to content

Add FP4 tile + vector types and memory operations#54

Open
ccs1112 wants to merge 3 commits into
HazyResearch:mainfrom
ccs1112:fp4-p1-types-and-memory
Open

Add FP4 tile + vector types and memory operations#54
ccs1112 wants to merge 3 commits into
HazyResearch:mainfrom
ccs1112:fp4-p1-types-and-memory

Conversation

@ccs1112
Copy link
Copy Markdown

@ccs1112 ccs1112 commented Apr 19, 2026

This PR closes #46 and #47. It lands fp4e2m1_2 as the canonical sub-byte tile dtype and adds the three load paths from #47.

FP4 breaks the sizeof(scalar) × packing::num() == sizeof(packed) invariant, so shared storage has to pick between packed and unpacked. Unpacked halves the shared budget and forces ALU nibble-packing on every shared → register load. A new st<fp4e2m1, ...> specialization would be perf-optimal but rewrites a core struct for one dtype. Using fp4e2m1_2 as the dtype, which I gleaned from ThunderKittens, has the same performance and costs only one relaxed static_assert, so I took this route.

rt_shape is type-agnostic, so num() has to count unpacked-type units per packed-type: 2 for FP4. The physical register layout is the same. rt_shape type params can be added as a follow up if so desired.

The crux is the new fp4e2m1_4 branch in shared_to_register.cuh. It's basically the fp8e4m3_4 branch copied with the new types, aliases and some compiler assertions. I don't think global_to_shared.cuh needs any changes because the existing byte math handles sizeof(dtype)==1 when the columns are in pairs.

I wrote a test file with some caveats:

  1. GPU_TARGET=CDNA4 make -j13 compiles clean on MI300X (Hot Aisle, rocm/7.0-preview container)
  2. I don't have MI35x access so if any one is able to run the unit tests for this branch on this device it would be greatly appreciated. I'm waiting to hear back from the AMD dev program on credits and device access.
  3. Main branch is failing with GPU_TARGET=CDNA3. I know main targets CDNA4 so if v3 support has been dropped then this isn't a bug, otherwise the fix for this can be filed separately: Cannot select: intrinsic llvm.amdgcn.raw.buffer.load.lds

ccs1112 added 3 commits April 18, 2026 13:51
Introduce a packing<fp4e2m1_2> specialization (num()=1, self-referential
unpacked_type) and modify packing<fp4e2m1_4>'s num() from 4 to 2 with
unpacked_type = fp4e2m1_2. HipKittens' rt_shape is type-agnostic, so
num() must count unpacked-type units per packed-type element for the
packed_per_thread arithmetic to hold the bf16/half/fp8 pattern.

Update the T1 concept to include fp4e2m1_2 in place of fp4e2m1. Allow
fp4e2m1_4 in rt_base's dtype allowlist. Relax st.cuh's num()==1
assertion to allow fp4e2m1_2, and reject the scalar fp4e2m1 in st and
gl with messages pointing users to the packed types. Add sv_fp4e2m1_2
and rt_fp4e2m1_2 aliases and fp4e2m1_2 <-> float2 convertors.
shared_to_register.cuh: new fp4e2m1_4 branch in the row-layout load at
RT::base_tile_stride == 16 (rt_16x128), mirroring the existing fp8e4m3_4
branch in width (ds_read_b128 with float4 cast). Handles both ST>=RT and
ST<=RT base-tile configurations.

global_to_register.cuh: add fp4e2m1_2 to the existing fp8e4m3 rejection
in all three static_asserts (row-load, col-load, row-store). Direct g->r
isn't used for packed sub-byte types.

Also reject fp4e2m1_2 in shared_to_register.cuh's register->shared store
path (not in PR 1's scope), and reject the scalar fp4e2m1 in the load
path to backstop the higher-level guards in st.cuh and gl.cuh.
testing_utils.cuh: pair-granularity initialize and validate branches
for fp4e2m1_2. i_ref/o_ref are sized in pair units; each pair is
packed as (f, f) so both halves dequantize to the same value.
Tolerance sized to FP4's 16-point grid (absolute 0.5).

fp4_load.cu (new): hand-rolled kernel that loads packed FP4 global ->
shared -> register and dumps each thread's 32 fp4e2m1_4 elements as
128 floats. Host sorts both sides and does multiset comparison with
FP4-grid tolerance. Exercises the new shared_to_register fp4e2m1_4
branch that the existing sharedreg_load_store round-trip can't reach
(register -> shared FP4 store isn't in HazyResearch#47's scope).

global_to_shared.cu: extend the type sweep with fp4e2m1_2, exercising
the g->s and s->g paths for packed FP4 at every supported shape.

testing_flags.cuh: wire TEST_WARP_MEMORY_TILE_FP4_LOAD into the
TEST_ALL_WARP_MEMORY_TILE expansion and the TEST_WARP_MEMORY_TILE
derivation. tile.cu/cuh dispatches the new test. Makefile adds the
flag to the default build.
@ccs1112 ccs1112 changed the title Add FP4 tile + vector types and memory operations (#46, #47) Add FP4 tile + vector types and memory operations Apr 19, 2026
@willhu-jpg
Copy link
Copy Markdown
Collaborator

Thanks! Can we actually work on FP4 in a branch together? I think it'll keep the main branch more stable as we go about things.

@ccs1112
Copy link
Copy Markdown
Author

ccs1112 commented Apr 30, 2026

Sounds good, looks like there's no existing FP4 branch on this upstream. Do you want to make one and I'll retarget the PR? I don't have write access on HazyResearch:main so I would only be able to make a branch on my fork, which would kind of defeat the purpose.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Add support for FP4 tile and vector types in CDNA4/main

2 participants